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Ilya Nemenman/ Michael E Wall,^ and Charlie E Strauss^ 

^Departments of Physics and Biology, Emory University, Atlanta^ GA 30SS^ 
^Computer, Computational, and Statistical S'cienceld 
^Biosciences Division, Los Alamos National Laboratory, Los Alamos, NM 8754^ 

(Dated: February 10, 2015) 

We present an algorithm to estimate the configurational entropy S of a polymer. The algorithm 
uses the statistics of coincidences among random samples of confignrations and is related to the 
catch-tag-release method for estimation of population sizes, and to the classic “birthday paradox”. 
Bias in the entropy estimation is decreased by grouping configurations in nearly equiprobable par¬ 
titions based on their energies, and estimating entropies separately within each partition. Whereas 
most entropy estimation algorithms require N ~ 2^ samples to achieve small bias, our approach 
typically needs only N ~ \/^. Thus the algorithm can be applied to estimate protein free energies 
with increased accuracy and decreased computational cost. 

PACS numbers: 87.15.A-, 89.70.Cf 


Computational estimation of protein free-energy differ¬ 
ences (e.g., between ligand-bound and ligand-free states) 
is an unsolved problem with broad applications in molec¬ 
ular biology and medicinal chemistry. Present ap¬ 
proaches to the problem may be divided into two classes: 
difference methods, in which the difference in free energy 
between two states is directly estimated, and end-point 
methods, in which absolute free energies are calculated 
for the states being compared. Difference methods suf¬ 
fer from slow convergence when there is little overlap 
between the states, though it may be possible to over¬ 
come this limitation [T]. In contrast, end-point methods 
are independent of the overlap between the states being 
compared. They are also trivially more efficient when 
pairwise free-energy differences among a large number of 
states are required. The main challenge for end-point 
methods is overcoming difficulties in estimating configu¬ 
rational entropy, which contributes substantially to pro¬ 
tein free energy HIS]- Recent progress has followed sev¬ 
eral threads: calibration against reference potentials for 
which the free energy can be exactly calculated [4] , selec¬ 
tive sampling about local minima in the energy landscape 
[^, hierarchical estimation of chain-elongation transi¬ 
tion probabilities using Monte Carlo simulations [B] , and 
transforming the degrees of freedom to occupy minimally 
coupled subspaces [7] . While the calibration method uses 
only randomly sampled configurations and their corre¬ 
sponding energies, the others often require the ability to 
calculate the energy of an arbitrary configuration. 

Entropy estimation has been recognized as a crucial 
problem in other disciplines, such as computational neu¬ 
roscience and cell biology 019]. Properties of entropy 
estimators have been studied extensively [Mia. Bias, 
rather than variance, is the dominant problem. Eor com¬ 
mon estimators, the bias (SS) is negative and scales as 


(SS) = (5est(n) - ^true(p)) « (1) 


Here Stme is the true entropy of the unknown probability 
distribution p, dimp = K. Sest is the entropy estimated 
from the measured frequencies n, i'ii® 

averaging (...) is taken over the random samples. Both 
^true and S'est are measured in bits. In particular, the 
Maximum Likelihood (ML) estimator 

5ML(n) = -;^^log2^, (2) 

i=l 

which uses the observed frequencies instead of the un¬ 
known probabilities, has bias that scales as in Eq. Q. 
This sets the limit on data requirements for traditional 
configurational entropy estimation methods. 

When the asymptotic bias follows Eq. Q, it can be 
subtracted from the estimate, making the latter nearly 
unbiased for N ^ [TTl [1^. For N < 2'^*'"'’, univer¬ 
sally unbiased estimation is impossible [TBl [TI] . Then a 
priori assumptions about the underlying probability dis¬ 
tribution are needed to regularize the inference. One such 
Bayesian prior, 7^(p), is known as the NSB (Nemenman- 
Shafee-Bialek) method [14]. It has been useful in neuro¬ 
science, but to our knowledge has not yet been applied to 
macromolecular entropy estimation. The approach starts 
with noting that seemingly reasonable prior assumptions 
V{p) may result in unexpected assumptions 'P{S). For 
example, consider a family of Dirichlet priors over p, in¬ 
dexed by a parameter j3, 

^(pi/5)=^<5 

\ / i=i 

Here the first term normalizes V, and the ^-function 
constrains the normalization of the distribution p itself. 
The product of introduces biases towards peaked 

(/3 —>■ 0) or uniform (/3 —>■ oo) distributions p. Maxi¬ 
mum likelihood inference of p with this prior is equiva¬ 
lent to adding /3 pseudocounts to every possible outcome 
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i. Importantly, for large K, these pseudocounts bias the 
resulting Bayesian entropy estimator strongly [14]. The 
entropy becomes “known” before any samples are mea¬ 
sured! One sees this by calculating the a priori entropy 
expectation S'o(/3) and its rms error (5S'o(/3) at a fixed /3, 
and the latter turns out to be very small M- One then 
uses a new prior over p and /3, 

( 4 ) 

This is different from Eq. ^ by the Jacobian dSo{l3)/, 
which ensures that, in the limit of a narrow and mono¬ 
tonic a priori expectation of S'o(/3), one gets Pnsb(<S'o) = 
f d/3dpVNSB(p, /3)S(So(/3) — S(p)) « const. Ref. [T4] has 
argued that this procedure creates a more uniform P(S) 
and reduces the estimation bias. 

The NSB estimator is related to the familiar birthday 
problem: in a year with K days, one only needs N ^ 
but not N ^ K, individuals in a room to make 
it likely that there will be at least one shared birthday. 
The same idea is behind the catch-tag-release estimation 
of wildlife population sizes: in a pond with K fishes, one 
will catch a fish that has been previously caught, tagged, 
and released after N ~ '/K fishes caught. It follows that 
one can estimate K by counting how many previously 
tagged fishes were caught. If every fish has the same 
probability to be caught, then S = log 2 K, and both S 
and K can be estimated with N ~ '/K, compared to the 
usual methods that require N ^ , cf. Eq. Q. Such 

estimation of entropies based on coincidence counting is 
known as the Ma estimator m- No general estimator 
can function reliably with fewer samples: if one never 
sees a repeat fish, then one only knows the minimum 
population size, but nothing about the maximum. 

Unfortunately, since logarithms diverge near zero, the 
low-probability, poorly sampled tail of p contributes dis- 
proportionally to entropy. Thus coincidence counting 
cannot transfer easily to non-uniform probability distri¬ 
butions m- One needs to use the high-probability events 
(i.e., coincidences) and extrapolate to the tail. Then the 
prior V{p) may be seen as enforcing a certain shape of 
the tail, and NSB assumes that the tail is not too heavy 
[Him. When the tail structure has been guessed cor¬ 
rectly, the estimate, S'NSB(n), converges to the true en¬ 
tropy in the Ma regime, N ~ [T^. For mas¬ 

sive tails, typically there is bias, (JS'nsb) < 0. How¬ 
ever, |(i5S'nsb)| < To verify if a sample size de¬ 

pendent bias is present, one estimates S^sb{o:N), where 
0 < a < I is the fraction of data used. If the estimates 
at different a agree within the posterior error bars, the 
bias can be neglected compared to the variance [18]. 

We wondered whether NSB might improve estima¬ 
tion of configurational entropies of polymer chains. For 
this, we generated self avoiding random walks of different 
lengths on a 3D lattice USHII]- We focused largely on chi¬ 


ral walks on a cubic lattice, so that the n’th bond in the 
chain is allowed to take, at most, three of the five pos¬ 
sible orientations, depending on the orientations of the 
n— I’st and the n — 2’nd bonds. Such chains weave chi¬ 
ral paths through the lattice, approximating the c-alpha 
secondary structure of real proteins [52]. We considered 
lattices with bounding cubes up to 4x4x4 (64 total sites) 
and homopolymeric chains of length L < 50. With these 
constraints, all self avoiding configurations and their en¬ 
ergies can be enumerated on commodity computers. We 
then calculated the partition function by direct summa¬ 
tion. This makes further sampling of random configura¬ 
tions trivial and decouples, for presentation purposes, the 
entropy estimation problem from the problem of efficient 
sampling, which is not the focus of this Letter. To en¬ 
sure sufficient generality of our results, we explored chiral 
chains of lengths 16 < L < 50, as well as short non-chiral 
chains. The results were similar for these cases. Thus 
here we present only chiral polymers with L = 32. 

In the spirit of Ref. [23] , we evaluated the energy of lat¬ 
tice conformations using energy functions that include 
local and long-range contributions: (1) backbone sec¬ 
ondary structure (SS) propensity, measured by a pref¬ 
erence among the 3 chiral local configurations [23]; (2) 
an approximation of solvent exposure per residue [25] . 
measured by the number of vacant sites surrounding each 
occupied lattice site in a fold; and (3) pair contact en¬ 
ergy via a Go-like model for preferred contacts [ 20 ]. The 
Go model and SS propensities were derived by simulat¬ 
ing arbitrary chains and then arbitrarily selecting a chain 
representative of a good protein fold (high contact order, 
low solvent exposure, and low radius of gyration) [ 201 1501 - 
150 ] as a reference model. We then assigned energies to 
the other chains in proportion to their distance from the 
reference. We quantified contact order as the average 
distance in sequence of two residues contacting in a fold. 
Contact order is a key statistic predicting folding rate, 
with high values indicating that the fold’s nucleation re¬ 
quires distal regions to come into contact, making it de¬ 
pendent on meshing of long range side-chain forces in 
addition to local backbone propensities [ 21 ]. In the Go 
model, chains with the same secondary structure motifs 
(short range potential) or the same pairwise long range 
contacts (not necessarily with the same neighbors) as the 
reference model have the lowest energy. Since the rela¬ 
tive importance of the effects represented by the different 
energy functions is unknown a priori, we explored each 
energy function independently, arguing that if NSB works 
for each function, it will work for their combinations. 

The choice of the temperature T for the analysis is 
very important. Indeed, for T —0, the configurational 
distribution is dominated by a few highly probable con¬ 
figurations. The entropy is low, and hence it is easy to 
estimate, cf. Eq. 0 . For T —> 00 , all configurations are 
equiprobable, and the Ma estimator is unbiased when 
N ^ a/X, where K is now the total number of config- 
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FIG. 1. Configurational entropy estimation for the first 
energy function and T — 1 a,, u. NSB (thick line) and the 
grouping estimator with M = 2... 6 partitions are shown in 
comparison to 5'true and the ML estimator. We sampled up to 
N = 10® « 2^® from the Boltzmann distribution. Error bars 
correspond to one posterior standard deviation. For NSB and 
the grouping estimator, we bounded the number of configura¬ 
tions in each partition and in total as « 2.74^. The bias of ML 
can be inferred and subtracted out when logj N > Strue ~ 14, 
so that N > 10^, which corresponds to the bend in the ML 
curve [13]. However, both ML and NSB are biased for smaller 
N. In contrast, the grouping estimators is unbiased for M = 3 
in the Ma regime of logj N > S'true/2 « 7, or N > 10^. 


urations. Intermediate temperatures are the most inter¬ 
esting. Here entropy is too high for simple methods to 
work, while the Ma estimator cannot be used because 
configurations are not equiprobable. For our chiral poly¬ 
mers, we estimate numerically K « 2.74^, where 2.74 
replaces 3 due to self-avoidance and finite volume. Thus, 
for r —>• oo, iStrue ~ log 2 2.74^, which is about 46 bits for 
L = 32. We are most interested in entropies substantially 
smaller than this, but much larger than 1 bit. 

Typical results of applying NSB to samples from the 
protein configurational distribution for the first energy 
function are illustrated in Fig. at an intermediate tem¬ 
perature T = 1 a. u., when 5'true ~ 13.65 bits. By the 
time log 2 N ~ 5true/2 ~ 7, many coincidences have oc¬ 
curred. The estimator is reporting small posterior vari¬ 
ances, but it is biased, though always less than ML. NSB 
remains biased even when N ^ 2'^*''“'. The bias finally 
disappears only when even the naive ML estimator is 
nearly unbiased, N ^ that is, many samples per 

typical configuration. Similar failures are observed for 
different sequence lengths and the other two energy func¬ 
tions. The bias likely stems from the assumptions of NSB 
being incompatible with the data. 

As the temperature increases above T = 3 a. u. and 
the entropy grows beyond 5true ~ 18 bit, the bias of 
5nsb becomes small at log 2 N > 5true/2 « 9, see Fig. 
While the bias is nonzero, it is comparable to the stan¬ 
dard deviation, making the estimator useable. Thus long 
tails disappear from the distribution of configurations. 


FIG. 2. Gonfigurational entropy estimation for the first en¬ 
ergy function and T = 3 a. u. Same conventions are used 
as in Fig. Note that the bias is much less of a prob¬ 
lem for this higher entropy case for NSB. As before, the 
grouping estimator can be made unbiased in the Ma regime, 
log2 ^ ~ 5true/2 « 9, Or A > 10®. 

and the estimator works at < 40% of the maximum pos¬ 
sible entropy (46 bits) for this polymer! Since no general 
entropy estimator can work until log 2 N > 5true/2, in 
this regime, NSB performs nearly optimally. 

We can capitalize on the accurate performance of NSB 
for near-uniform distributions at high T. In the fish 
counting problem, basses, carps, and catfishes may have 
different probabilities of being caught, while the proba¬ 
bilities may be closer to uniform within the species. Thus 
counting each species separately will improve population 
estimates, but at the cost of needing a larger N to ensure 
that coincidences (catching a tagged fish) happen for each 
species. Similarly, suppose the space of possible protein 
configurations is split into partitions /i = 1,... ,M. 
Then by the grouping axiom for entropy, [30] 

M 

S{p) = + S{tv), (5) 

tj.=i 

where 5(i/^) stands for the entropy of the partition 
TT^ = Pi is the probability of a particular partition, 

and 5(7r) is the entropy of the partition choice. While the 
overall p may be incompatible with NSB, the estimator 
may perform better on each of the partitions separately, 
resulting in the new grouping estimator m- 

M 

5gr(n,M) = ^ + 5 ’nsb(0), (6) 

M 

(5^5'gr(n,M) 

+ (5^5nsb(0) 

M 

~ ^ + S‘^S-msb{(1>)- ( 7 ) 























4 


Here (p^ = ^^e empirical frequencies of each 

partition, and is the posterior variance. For Sgr(M) 
to be unbiased, the partitions should be chosen such 
that either (i) distributions of configurations within each 
partition are more uniform (allowing Ma’s arguments to 
work), or (ii) structure of tails within each partition is 
compatible with NSB. If a non-NSB estimator is used in 
the r. h. s. of Eq. (§, partitions should be chosen instead 
to make that estimator unbiased. 

Since each polymer configuration has an energy value 
that is known, and configurations with similar energies 
are nearly equiprobable, a natural partitioning exists in 
this context. We expect reduction in bias if one assigns a 
configuration with energy E to the partition /i, for which 
Emin + (Emax ~ Emin)(M ~ 1)/-^ ^ E < Emin + (Emax ~ 
Emin )/^/-^ I where Emin and Emax are the minimum and 
the maximum energy in a given sample. 

However, such partitioning comes at a cost. First, each 
of the terms in Eq. (§ has statistical errors. The er¬ 
rors add in quadratures, so that the estimator variance, 
(S^Sgr), typically grows with M. Second, when M K, 
S{Tr) approaches ^tme and becomes equally hard to esti¬ 
mate. Third, for M > 1, one needs coincidences in each 
partition. This requires more data, and would lead to 
{SSgr) > 0 if some of the partitions have no coincidences. 
Finally, one doesn’t know the maximum possible number 
of configurations Ki, in each partition, and has to take 
K,^ = K. Larger K results in a larger Ensb, though the 
dependence is weak [T7]. Thus (SSgr) > 0 for M 3> 1. 
Combined, these concerns indicate that success of the 
grouping estimator in polymer problems is uncertain. 

We tested the performance of the grouping estimator 
for different E, T, and energy functions. The results were 
consistent with the expectations and similar for all cases. 
As seen in Fig.[^ increasing the number of partitions first 
decreases the bias. (SSgr) is insignificant for M ^ 2 ... 4. 
For M so small, each partition is sampled well, and Sg^ 
works in the Ma regime. However, as M grows, the bias 
changes sign and increases again. Similar results hold for 
higher temperatures, Fig.[^ Here the bias is small for all 
M, and it is dramatically smaller than the ML bias. 

These results suggest a straight-forward algorithm for 
estimation of polymer configurational entropies. For a 
given sample of configurations and their energies, one 
computes Sg,-{aN, M) using Eq. Q and 6‘^SgT-{aN, M) 
for M = 1 using Eq. Q, while varying the fraction 
of the data used for the estimation, 0 < a < 1. One 
looks for the sample size dependent bias by verifying if 
Sgr{aN, 1) drifts by more than the standard deviation as 
a increases. If the bias is positive, the algorithm cannot 
be applied (this has never happened in our tests). If the 

bias is insignificant, then Sgr{N, 1) ± ((5^S'gr(iV, 1)) is 
the entropy estimate. If the bias is negative, then one 
increments M —>■ M + 1, and repeats the estimation for 
various a. One increments M until it reaches M*, such 


that 6Sgr{N, M*) > 0. The best estimate, the bias, and 
the variance are then the means of the corresponding 
quantities for Sgr(N, M* — I) and Sgr{N, M*). Crucially, 
unless M* ^ 1, coincidences are present in all partitions. 
Thus the proposed estimator will work in the Ma regime, 
log 2 N ~ Strue/S, providing a square root data require¬ 
ment reduction compared to simpler approaches. Since 
coincidences are required for any estimator to work, it is 
unlikely that other general purpose estimators will sub¬ 
stantially outperform the NSB-based grouping algorithm. 

For off-lattice polymers, the entropy can be computed 
by enumerating local minima in the energy landscape 
and additionally estimating the entropy within each such 
basin of attraction [32] ■ For the latter, there are good 
methods for entropy estimation based on kernel smooth¬ 
ing or nearest neighbor techniques |33| |31j; we expect 
that these also will be improved by grouping. Alterna¬ 
tively, the entropy in a local basin may be estimated ana¬ 
lytically using the normal modes approximation [32] . We 
expect the present version of the NSB algorithm with 
grouping to be especially useful for the former, that is 
for calculating contributions to entropy from many sim¬ 
ilar local minima, which are observed for rugged energy 
landscapes that are characteristic of real proteins [55] . 

In summary, in this Letter, we have verified that the 
grouping generalization of the NSB algorithm can be used 
to produce reliable estimates of configurational entropies 
of polymer chains in the severely undersampled regime 
N ~ using only random samples of configura¬ 

tions and their corresponding energies. The estimator is 
available from http;//nsb-entropy, sourceforge.net 
as a C-|—I- and Matlab/Octave code. The estimation 
is rapid on modern computers. For lattice polymers 
of length ~ 30, it requires only ~ 1000 configuration 
samples. Thus the square-root scaling suggests that the 
method will be able to work with sequences of previously 
unaccessible lengths. 
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